Skip to content

common/parser: add proper reasoning tag prefill reading#20424

Merged
pwilkin merged 14 commits intoggml-org:masterfrom
pwilkin:reasoning-prefill
Mar 19, 2026
Merged

common/parser: add proper reasoning tag prefill reading#20424
pwilkin merged 14 commits intoggml-org:masterfrom
pwilkin:reasoning-prefill

Conversation

@pwilkin
Copy link
Contributor

@pwilkin pwilkin commented Mar 11, 2026

This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting <think> in assistant prefill after a "no_think" appears in a user message).

Therefore, the FORCED_OPEN and FORCED_CLOSED formats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deleted DELIMITER also since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the <think>' / ` was added by the template or generated by the model.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 11, 2026

Fixes #20356
Fixes #20325
Fixes #20265

This also clears the ground for disabling grammar triggers inside reasoning loops in a subsequent PR, which would resolve #20260

@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples server labels Mar 11, 2026
@aldehir
Copy link
Contributor

aldehir commented Mar 11, 2026

Dumb question, why not find the start of the assistant message and prepend that?

I agree it would be easier to parse if we had a "prefill" of some sort that normalizes the input, such that we can handle the logic in the grammar and not through flags. However, if we're going this route I would look into prepending the start of the entire assistant message. This will also open the door for parsing output from requests with an assistant prefill.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 11, 2026

Yeah, that would be the logical conclusion, but for now it's easier for me just to extract the reasoning markers since finding the actual start of the assistant message is nontrivial.

@aldehir
Copy link
Contributor

aldehir commented Mar 11, 2026

Qwen3.5 uses

<think>\n\n</think>\n\n
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- else %}
{{- '<think>\n' }}
{%- endif %}

however,

      "reasoning_prefill": "<think></think>\n\n",

It probably doesn't matter for this model, but it is technically not adhering to the template.

@aldehir
Copy link
Contributor

aldehir commented Mar 11, 2026

    {
      "id": 248045,
      "piece": "<|im_start|>"
    },
    {
      "id": 74455,
      "piece": "assistant"
    },
    {
      "id": 198,
      "piece": "\n"
    },
    {
      "id": 248068,
      "piece": "<think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    },
    {
      "id": 248069,
      "piece": "</think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    }

Maybe set reasoning_prefill from the start of the opening tag to the end of the prompt?

@aldehir
Copy link
Contributor

aldehir commented Mar 11, 2026

finding the actual start of the assistant message is nontrivial.

Run the template once with add_generation_prompt = false, capture the size, run again with true, extract the string content that spans the delta? I think that would work in most cases.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 12, 2026

That usually works, yeah 😀 I can try that and see what the results are (this is what calculate_diff_split from the analyzer does BTW). I'm just worried about some weird edge cases.

@bsdice
Copy link

bsdice commented Mar 14, 2026

Nice patch! With model https://huggingface.co/mradermacher/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking-GGUF the patches fix webui getting confused on /think and not splitting correctly reasoning and generation part. Build llama.cpp-cuda-git-b8334.r9.710878a7dd-1.

@pwilkin pwilkin force-pushed the reasoning-prefill branch from 3bfb08f to 4083259 Compare March 14, 2026 14:49
@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 14, 2026

@aldehir changed the prefill extraction behavior to the differential one you mentioned.

common/chat.h Outdated
std::string grammar;
bool grammar_lazy = false;
bool thinking_forced_open = false;
std::string prefill;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we name this generation_prompt? It lines up with the add_generation_prompt flag.

Comment on lines +71 to +95
bool clear_reasoning_start = false;
if (inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE &&
autoparser.reasoning.mode != reasoning_mode::NONE &&
!autoparser.reasoning.end.empty()) {
const auto & r_start = autoparser.reasoning.start;
const auto & r_end = autoparser.reasoning.end;
auto r_end_t = trim_trailing_whitespace(r_end);
auto r_start_t = trim_trailing_whitespace(r_start);

if (!r_start_t.empty()) {
auto start_pos = prompt_to_search.rfind(r_start_t);
if (start_pos != std::string::npos) {
std::string from_start = prompt_to_search.substr(start_pos);
auto fs_trimmed = trim_trailing_whitespace(from_start);

if (string_ends_with(fs_trimmed, r_end_t)) {
data.prefill = r_start + r_end;
} else if (string_ends_with(fs_trimmed, r_start_t)) {
data.prefill = from_start;
} else {
clear_reasoning_start = true;
}
}
}
}
Copy link
Contributor

@aldehir aldehir Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my understanding is: we have a generation prompt G, we can create a parser that accepts G[0:max(G.size(), G.index_of(reasoning_start))] + (reasoning_start + reasoning + reasoning end)? + .... Then we can do away with all the trim logic.

The benefit is that now the parser can properly parse assistant prefill from the user, since the parser starts from the beginning of the assistant message.

I see that Mistral's templates have no generation prompt, so G = "". But this is fine, because the model emits the [THINK] tag. So the above still works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround is mostly for Apriel that has a delimited thinking format and inserts a header "Thinking chain starts here: " or something like that as the generation prompt which acts as a quasi-reasoning marker that we want to strip.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

@aldehir okay, that rewrite ended up being a bit bigger than I expected... but it's exactly the algorithm you mentioned now.

@aldehir
Copy link
Contributor

aldehir commented Mar 15, 2026

Oh jeez, well it's <100 net LOC. I'll give it a whirl.

@pwilkin pwilkin requested review from a team as code owners March 15, 2026 15:02
@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

@aldehir happy to report I added another nice piece of code to make it work correctly with grammars / schemas :)

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Mar 19, 2026

With OpenCode and Unsloth's Qwen3.5 35B, I seem to be getting all think blocks in the response now with </think> attached.

image

This is with "show thinking" off.

Wasn't happening yesterday, so I'm guessing it's related?

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 19, 2026

@strawberrymelonpanda argh.

Can you check with vanilla template?

@strawberrymelonpanda
Copy link
Contributor

Willing to, is there a command I can use to bypass unsloth's or do I need a different model?

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Mar 19, 2026

@pwilkin looks like it's happening on Ubergarm's quant too.

image

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Mar 19, 2026

@pwilkin I rolled back to the commit right before this PR, and I no longer see the thinking content and tags.

With show thinking turn ON on commit c125883, using Ubergarm's quant:
image
(turned on otherwise there'd be nothing to show it's working)

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 20, 2026

@strawberrymelonpanda can you give the exact server command?

I'm trying to repro on various Qwen3.5 4B quants but all seems correct so far.

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Mar 20, 2026

GGML_CUDA_GRAPH_OPT=1 \
LLAMA_SERVER_SLOTS_DEBUG=1 \
llama-server \
--seed 1 \
--threads 8 \
--host 127.0.0.1 \
--port <port> \
--props \
--no-mmap \
--direct-io \
--models-preset ./presets.ini \
--models-max 1 \
[*]
spec-type = ngram-mod
spec-ngram-size-n = 24
draft-min = 48
draft-max = 64
parallel = 1
flash-attn = on
fit = on

; Qwen
[qwen35-35b]
load-on-startup = true
model = <path>/Qwen3.5-35B-A3B-Q4_K_S (unsloth).gguf
;model = <path>/Qwen3.5-35B-A3B-Q4_0 (ubergarm).gguf
mmproj = <path>/Qwen3.5-35B-A3B-mmproj-bf16.gguf
fit-target = 1800
ctx-size = 80000
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.00

(Tested with both models)
OpenCode version 1.2.26

Start opencode in llama.cpp folder, @ commit c1b9116 (master).

I ran the command

Add a sleep API endpoint that triggers the result of sleep_idle_seconds

(I was seeing if it could, for local use, because I'd like this without a full unload - I think there's some differences?)

For this command in specific, I first see thinking work twice:

  • image
  • image

Then start failing:

  • image
  • image

I can't promise it's related, but my OpenCode is set to manual update and hasn't changed, and after rolling back the </think> tags stopped entirely.


The same commands on c125883

  • image
  • image
  • image

and continues on throughout the task without issue.

pwilkin added a commit that referenced this pull request Mar 20, 2026
…on) (#20777)

* chat : fix out_of_range crash in throw path (#20424 regression)

#20424 introduced effective_input = generation_prompt + input, but the
throw path uses input.substr(result.end) where result.end is a position
within effective_input. Every thinking model with a non-empty
generation_prompt crashes with std::out_of_range instead of the intended
error message.

Test crashes on unpatched master, passes with fix:

  cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
  cmake --build build --target test-chat
  ./build/bin/test-chat

* Update test-chat.cpp

* Update test-chat.cpp

* Update test-chat.cpp

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Mar 20, 2026

@pwilkin I'm using a deterministic seed, --seed 1, so with any luck maybe you can reproduce it.

Not sure how much the RNG varies based on hardware though.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 20, 2026

Can you please try without the speculative decoding as well? (gtg to sleep now but will try to repro tomorrow)

@strawberrymelonpanda
Copy link
Contributor

Same results without

spec-type = ngram-mod
spec-ngram-size-n = 24
draft-min = 48
draft-max = 64

For that command, works twice, fails twice, etc.

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Mar 20, 2026

If you're not familiar with OpenCode, here's a opencode.json config file that should work with Llama.cpp. It's a stripped down version of my own, don't think I removed anything important. Just some permissions and extra models.

opencode.json
{
  "$schema": "https://opencode.ai/config.json",  

  "share": "disabled",
  "autoupdate": false,
  "enabled_providers": ["llama-cpp"],
  
  "formatter": false,
  "lsp": false,  

  "compaction": {
    "auto": true,
   	"prune": true,
   	"reserved": 10000
  },
  
  "permission": {  	
    "webfetch": "ask",
    "websearch": "ask",    
    "bash": {
      "*": "ask",
    }
  },

  "provider": {
    "llama-cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:<port>/v1"
      },
      "models": {
        "qwen35-35b": {
          "name": "qwen35-35b",
          "modalities": {
            "input": [ "text", "image" ],
            "output": [ "text" ]
          },
          "limit": {
            "context": 80128,
            "output": 99999
          }          
        }
      }
    }
  },

  "model": "llama-cpp/qwen35-35b",
  "small_model": "llama-cpp/dummy"
}

"small_model": "llama-cpp/dummy" is because there was recently an issue where, unless small model was set to something, it was sending data to their servers to get session titles from GPT Nano. Giving it a dummy just makes it fail and use a timestamp.

You can remove this line, but as of recent commits it'll cause the large model to try to make the title instead, which can be a significant delay, so I keep it for a fast fail.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 20, 2026

@strawberrymelonpanda I test everything on OpenCode, which is why I'm surprised to see this arise. I'll try to repro on the smaller models first, but I have a 35B quant lying around somewhere if I can't.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 20, 2026

@strawberrymelonpanda got a repro, looking into it.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 20, 2026

Aaaaand it's gone... reproduced once, set up an MITM proxy and now it's gone :P

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* Implement proper prefill extraction

* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp

* Update tools/server/server-task.cpp

* refactor: move grammars to variant, remove grammar_external, handle exception internally

* Make code less C++y

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
…regression) (ggml-org#20777)

* chat : fix out_of_range crash in throw path (ggml-org#20424 regression)

ggml-org#20424 introduced effective_input = generation_prompt + input, but the
throw path uses input.substr(result.end) where result.end is a position
within effective_input. Every thinking model with a non-empty
generation_prompt crashes with std::out_of_range instead of the intended
error message.

Test crashes on unpatched master, passes with fix:

  cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
  cmake --build build --target test-chat
  ./build/bin/test-chat

* Update test-chat.cpp

* Update test-chat.cpp

* Update test-chat.cpp

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 20, 2026

@strawberrymelonpanda happy to report I found the cause :)

Can you please check if #20825 resolves it?

@strawberrymelonpanda
Copy link
Contributor

Looks good. 👍

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples python python script changes server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants